Skip to content

Conversation

@BBC-Esq
Copy link

@BBC-Esq BBC-Esq commented Apr 5, 2025

You're already using the best langchain pdf loader, "pymupdfparser." However, the default "mode" parameter is "page," which means that it extracts the text from each page of the .pdf (and it's page metadata) and each page is then split by the recursivecharactertextsplitter. The "page" mode within langchain uses the get_text method from pymupdf, which extracts text from a single page of PDF.

Langchain's other option is "single" mode, which also uses get_text from pymupdf but then concatenates everything. The huge drawback is that you lose the page metadata...and forget about trying to assign page metadata to each "chunk"...

This PR solves that issue, ultimately allowing for accurate "page citations" in a user's application.

It uses custom loader/parser classes:

  1. Still uses page mode, but prepends a unique page marker to the text extracted (e.g. [[page1]], [[page2]] and so on).
  2. Concatenates all of the text.
  3. Creates a "clean" copy of the entire concatenated text WITHOUT the page markers.
  4. Splits the "clean" text.
  5. For each chunk it uses regex to search the concatenated text WITH the page markers to determine where each "chunk" begins by looking for the first page marker PRIOR to that chunk.
  6. Assigns accurate page metadata for each chunk.

The benefits of this are that chunks of text are no longer artificially split at the page boundaries of the pdf itself - i.e. chunks can span pages.

The only "parser" from langchain does does this "out of the box" is pdfminer, but it's insanely slow. Therefore, I highly recommend using this custom pymupdf approach instead.

Moreover, it more accurately respects the chunk_size parameter. For example, if a pdf only has 200 characters on a particular page of a PDF you'll get a chunk of 200 characters even if you set chunk_size to 1,000,000. The custom approach allows chunks to extend between pages, obviating this problem.

Many embedding models can now handle chunk sizes well above the standard 512 characters and there are use cases for it...but overall it's just better to have accurate page metadata for each "chunk" after processing the entire concatenated text...

@BBC-Esq BBC-Esq changed the title improve pymupdfparser improve extracting PDF text Apr 5, 2025
@BBC-Esq
Copy link
Author

BBC-Esq commented May 14, 2025

Can someone look at this?

@BBC-Esq BBC-Esq closed this by deleting the head repository Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant